Please install the following packages before you start this tutorial:
Please download/clone a GitHub
repository. This is a public repository, so please don’t commit/push
anything. Tutorial files are located in the files
directory.
For this tutorial, we will be analyzing the a dataset of scRNA-seq (10x Genomics) to profile cell composition across a time course of human/chimpanzee organoid development from pluripotency to four months using embryonic stem cells (H9) and iPSC (409b2) lines and chimpanzee iPSC (Sandra/SandraA) lines (Kanton et al. 2019). This means, the dataset contains multiple readcount metrics per cells including developmental time information. Please take a look at the figure a below.
Differentiation of human and chimpanzee cerebral organoids.
Load the libraries that we need for today’s tutorial
library("dplyr")
library("Seurat")
We start by reading in the data. The readRDS() function
reads Large Seurat files in the output of the Seurat pipeline, returning
a S4 SeuratObject. The values in assays represent the multiple read
count tables for each feature (i.e. gene; row) that are detected in each
cell (column).
Let’s import the rds files using here
hu <- readRDS("files/Tutorial_tc_human_integ_RNAassay.rds")
hu
## An object of class Seurat
## 33694 features across 2000 samples within 1 assay
## Active assay: RNA (33694 features, 4506 variable features)
## 2 dimensional reductions calculated: pca, tsne
ch <- readRDS("files/Tutorial_tc_chimp_integ_RNAassay.rds")
ch
## An object of class Seurat
## 32856 features across 2499 samples within 1 assay
## Active assay: RNA (32856 features, 3680 variable features)
## 2 dimensional reductions calculated: pca, tsne
Explore the Large Seurat objects. Reading the readcount
metrics using, e.g. hu[["RNA"]]@counts, will return many
dots, meaning zero values. To see some actual values, we’ll look at the
first 10 columns and 30 rows in case of human dataset and the first 10
columns and 200th-230th rows in case of chimpanzee dataset.
head(hu@meta.data, 5)
hu[["RNA"]]@counts[1:10, 1:30]
## 10 x 30 sparse Matrix of class "dgCMatrix"
## [[ suppressing 30 column names 'EC00044', 'EC00060', 'EC00075' ... ]]
##
## RP11-34P13.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## FAM138A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## OR4F5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## RP11-34P13.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## RP11-34P13.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## RP11-34P13.14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## RP11-34P13.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## FO538757.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## FO538757.2 1 . . . . . . 1 . . . 1 . . 1 . . 1 1 1 . . 1 . . . . . . .
## AP006222.2 . 2 . . . 1 . . . . . . . 1 . . . 1 . . . . . . . . . . . .
head(ch@meta.data, 5)
ch[["RNA"]]@counts[1:10, 200:230]
## 10 x 31 sparse Matrix of class "dgCMatrix"
## [[ suppressing 31 column names 'EC14799', 'EC07259', 'EC07809' ... ]]
##
## MIR1302-2HG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## FAM138A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## OR4F5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## AL627309.1 . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . .
## AL627309.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## AL627309.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## AL627309.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## AL732372.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## OR4F29 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## AC114498.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
As you see, those rds files are already pre-processed,
for instance cell lines are already integrated (the line
column has two cell line names when running
head(hu\@meta.data, 5)) and the linear/non-linear
dimensional reduction has already been performed
(2 dimensional reductions calculated: pca, tsne are stored
when running hu). We’ll now split the datasets, so we can
practice the whole process to learn how to run a standard Seurat
clustering pipeline starting with integration.
Split the dataset into a list of two seurat objects by cell line.
Split_hu <- SplitObject(hu, split.by = "line")
Split_hu
## $H9
## An object of class Seurat
## 33694 features across 971 samples within 1 assay
## Active assay: RNA (33694 features, 4506 variable features)
## 2 dimensional reductions calculated: pca, tsne
##
## $h409B2
## An object of class Seurat
## 33694 features across 1029 samples within 1 assay
## Active assay: RNA (33694 features, 4506 variable features)
## 2 dimensional reductions calculated: pca, tsne
Split_ch <- SplitObject(ch, split.by = "line")
Split_ch
## $SandraA
## An object of class Seurat
## 32856 features across 559 samples within 1 assay
## Active assay: RNA (32856 features, 3680 variable features)
## 2 dimensional reductions calculated: pca, tsne
##
## $Sandra
## An object of class Seurat
## 32856 features across 1940 samples within 1 assay
## Active assay: RNA (32856 features, 3680 variable features)
## 2 dimensional reductions calculated: pca, tsne
After spliting cell lines from the dataset, the next step is to
normalize the data. By default, we employ a global-scaling normalization
method LogNormalize that normalizes the feature expression
measurements for each cell by the total expression, multiplies this by a
scale factor (10,000 by default), and log-transforms the result.
Normalized values are stored in
NorFea_hu$H9[["RNA"]]@data.
We next calculate a subset of features that exhibit high cell-to-cell
variation in the dataset (i.e, they are highly expressed in some cells,
and lowly expressed in others). Focusing on these genes in downstream
analysis helps to highlight biological signal in single-cell datasets.
This is implemented in the FindVariableFeatures() function.
By default, we return 2,000 features per dataset. These will be used in
downstream analysis, like PCA.
The normalization and identification of variable features for each
dataset independently will be done in one commend using
laaply(). By running the function
VariableFeaturePlot(), we can view the variable features.
In this tutorial, we’ll see the 10 most variable features.
NorFea_hu <- lapply(X = Split_hu, FUN = function(x) {
x <- NormalizeData(x)
x <- FindVariableFeatures(x, selection.method = "vst", nfeatures = 2000)
})
NorFea_hu
## $H9
## An object of class Seurat
## 33694 features across 971 samples within 1 assay
## Active assay: RNA (33694 features, 2000 variable features)
## 2 dimensional reductions calculated: pca, tsne
##
## $h409B2
## An object of class Seurat
## 33694 features across 1029 samples within 1 assay
## Active assay: RNA (33694 features, 2000 variable features)
## 2 dimensional reductions calculated: pca, tsne
NorFea_hu$H9[["RNA"]]@data[1:10, 1:30]
## 10 x 30 sparse Matrix of class "dgCMatrix"
## [[ suppressing 30 column names 'EC00044', 'EC00060', 'EC00296' ... ]]
##
## RP11-34P13.3 . . . . . . . . .
## FAM138A . . . . . . . . .
## OR4F5 . . . . . . . . .
## RP11-34P13.7 . . . . . . . . .
## RP11-34P13.8 . . . . . . . . .
## RP11-34P13.14 . . . . . . . . .
## RP11-34P13.9 . . . . . . . . .
## FO538757.3 . . . . . . . . .
## FO538757.2 0.5656387 . . . . . 0.3090438 0.4445882 .
## AP006222.2 . 1.005712 0.4071387 . . . . . .
##
## RP11-34P13.3 . . . . . . . . . . . .
## FAM138A . . . . . . . . . . . .
## OR4F5 . . . . . . . . . . . .
## RP11-34P13.7 . . . . . . . . . . . .
## RP11-34P13.8 . . . . . . . . . . . .
## RP11-34P13.14 . . . . . . . . . . . .
## RP11-34P13.9 . . . . . . . . . . . .
## FO538757.3 . . . . . . . . . . . .
## FO538757.2 0.4108021 0.4861095 . . 0.3577626 . . . . . . .
## AP006222.2 0.4108021 . . . . . . . . 0.6489736 0.4776818 .
##
## RP11-34P13.3 . . . . . . . . .
## FAM138A . . . . . . . . .
## OR4F5 . . . . . . . . .
## RP11-34P13.7 . . . . . . . . .
## RP11-34P13.8 . . . . . . . . .
## RP11-34P13.14 . . . . . . . . .
## RP11-34P13.9 . . . . . . . . .
## FO538757.3 . . . . . . . . .
## FO538757.2 0.5001261 . . 0.4881182 . . . 0.2898991 .
## AP006222.2 0.5001261 . . 0.4881182 . 0.7170238 . 0.2898991 0.3503035
VFP_H9 <- VariableFeaturePlot(NorFea_hu$H9)
VFP_H9 <- LabelPoints(plot = VFP_H9, points = head(VariableFeatures(NorFea_hu$H9), 10), repel = TRUE, xnudge = 0, ynudge = 0)
VFP_h209B2 <- VariableFeaturePlot(NorFea_hu$h409B2)
VFP_h209B2 <- LabelPoints(plot = VFP_h209B2, points = head(VariableFeatures(NorFea_hu$h409B2), 10), repel = TRUE, xnudge = 0, ynudge = 0)
VFP_H9
VFP_h209B2
NorFea_ch <- lapply(X = Split_ch, FUN = function(x) {
x <- NormalizeData(x)
x <- FindVariableFeatures(x, selection.method = "vst", nfeatures = 2000)
})
NorFea_ch
## $SandraA
## An object of class Seurat
## 32856 features across 559 samples within 1 assay
## Active assay: RNA (32856 features, 2000 variable features)
## 2 dimensional reductions calculated: pca, tsne
##
## $Sandra
## An object of class Seurat
## 32856 features across 1940 samples within 1 assay
## Active assay: RNA (32856 features, 2000 variable features)
## 2 dimensional reductions calculated: pca, tsne
NorFea_ch$SandraA[["RNA"]]@data[1:10, 200:230]
## 10 x 31 sparse Matrix of class "dgCMatrix"
## [[ suppressing 31 column names 'EC14799', 'EC07259', 'EC07809' ... ]]
##
## MIR1302-2HG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## FAM138A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## OR4F5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## AL627309.1 . . . . . . . . 0.4267403 . . . . . . . . . . . . . . . . . . . . .
## AL627309.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## AL627309.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## AL627309.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## AL732372.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## OR4F29 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## AC114498.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
##
## MIR1302-2HG .
## FAM138A .
## OR4F5 .
## AL627309.1 .
## AL627309.3 .
## AL627309.2 .
## AL627309.4 .
## AL732372.1 .
## OR4F29 .
## AC114498.1 .
VFP_SA <- VariableFeaturePlot(NorFea_ch$SandraA)
VFP_SA <- LabelPoints(plot = VFP_SA, points = head(VariableFeatures(NorFea_ch$SandraA), 10), repel = TRUE, xnudge = 0, ynudge = 0)
VFP_S <- VariableFeaturePlot(NorFea_ch$Sandra)
VFP_S <- LabelPoints(plot = VFP_S, points = head(VariableFeatures(NorFea_ch$Sandra), 10), repel = TRUE, xnudge = 0, ynudge = 0)
VFP_SA
VFP_S
Now, we see that the number of variable features has been changed to
2000 in each cell line. We will now select features that are repeatedly
variable across datasets for integration.
features_h <- SelectIntegrationFeatures(object.list = NorFea_hu)
head(features_h, n = 10)
## [1] "TTR" "HTR2C" "PCP4" "LGALS1" "DLK1" "CXCL14" "COL1A1" "COL3A1"
## [9] "IGFBP7" "PTN"
features_c <- SelectIntegrationFeatures(object.list = NorFea_ch)
head(features_c, n = 10)
## [1] "APOA1" "APOA2" "COL3A1" "APOB" "IGF2" "TTR" "APOA4" "COL1A1"
## [9] "LUM" "DLK1"
We then identify anchors using the
FindIntegrationAnchors() function, which takes a list of
Seurat objects as input, and use these anchors to integrate
the two datasets together with IntegrateData().
anchors_hu <- FindIntegrationAnchors(object.list = NorFea_hu, anchor.features = features_h)
## Scaling features for provided objects
## Finding all pairwise anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
## Found 2423 anchors
## Filtering anchors
## Retained 2257 anchors
anchors_hu
## An AnchorSet object containing 4514 anchors between 2 Seurat objects
## This can be used as input to IntegrateData.
anchors_ch <- FindIntegrationAnchors(object.list = NorFea_ch, anchor.features = features_c)
## Scaling features for provided objects
## Finding all pairwise anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
## Found 2377 anchors
## Filtering anchors
## Retained 857 anchors
anchors_ch
## An AnchorSet object containing 1714 anchors between 2 Seurat objects
## This can be used as input to IntegrateData.
We then pass these anchors to the IntegrateData()
function, which returns a Seurat object.
The returned object will contain a new Assay, which holds an integrated (or ‘batch-corrected’) expression matrix for all cells, enabling them to be jointly analyzed.
hu_integ <- IntegrateData(anchorset = anchors_hu)
## Merging dataset 1 into 2
## Extracting anchors for merged samples
## Finding integration vectors
## Finding integration vector weights
## Integrating data
hu_integ
## An object of class Seurat
## 35694 features across 2000 samples within 2 assays
## Active assay: integrated (2000 features, 2000 variable features)
## 1 other assay present: RNA
ch_integ <- IntegrateData(anchorset = anchors_ch)
## Merging dataset 1 into 2
## Extracting anchors for merged samples
## Finding integration vectors
## Finding integration vector weights
## Integrating data
ch_integ
## An object of class Seurat
## 34856 features across 2499 samples within 2 assays
## Active assay: integrated (2000 features, 2000 variable features)
## 1 other assay present: RNA
Now, the number of features has also been changed to 2000 in each
object and the active assay is set to integrated. The
original unmodified data still resides in the RNA
assay.
As this integrated assay is already normalized and the variable
features are identified, we can continue with scaling, a linear
transformation that is a standard pre-processing step prior to
dimensional reduction techniques like PCA. The ScaleData()
function shifts the expression of each gene, so that the mean expression
across cells is 0 and scales the expression of each gene, so that the
variance across cells is 1, so highly-expressed genes do not dominate.
The results of this are stored in
hu_integ[["RNA"]]@scale.data
hu_integ <- ScaleData(hu_integ, verbose = FALSE)
ch_integ <- ScaleData(ch_integ, verbose = FALSE)
Next, we perform PCA on the scaled data. By default, only the previously determined variable features are used as input and we define the number of PCs to compute and store (50 by default).
hu_integ <- RunPCA(hu_integ, verbose = FALSE)
hu_integ
## An object of class Seurat
## 35694 features across 2000 samples within 2 assays
## Active assay: integrated (2000 features, 2000 variable features)
## 1 other assay present: RNA
## 1 dimensional reduction calculated: pca
ch_integ <- RunPCA(ch_integ, verbose = FALSE)
ch_integ
## An object of class Seurat
## 34856 features across 2499 samples within 2 assays
## Active assay: integrated (2000 features, 2000 variable features)
## 1 other assay present: RNA
## 1 dimensional reduction calculated: pca
Now, the Seurat object contains
1 dimensional reduction: pca.
To overcome the extensive technical noise in any single feature for scRNA-seq data, Seurat clusters cells based on their PCA scores, with each PC essentially representing a ‘metafeature’ that combines information across a correlated feature set. The top principal components therefore represent a robust compression of the dataset. However, how many components should we choose to include? To answer this question, we’ll consider three approaches below.
Seurat provides several useful ways of visualizing both cells and
features that define the PCA, including VizDimReduction(),
DimPlot(), and DimHeatmap(). In particular
DimHeatmap() allows for easy exploration of the primary
sources of heterogeneity in a dataset, and can be useful when trying to
decide which PCs to include for further downstream analyses.
For instance, let’s visualize the first and last PC.
DimHeatmap(hu_integ, dims = c(1,50), cells = 500, balanced = TRUE)
DimHeatmap(ch_integ, dims = c(1,50), cells = 500, balanced = TRUE)
The first PC shows a clear heterogeneity, while the last one doesn’t. The cutoff should be somewhere between PC1 and PC50 (wink wink ;))
As next approach, we randomly permute a subset of the data (1% by default) and rerun PCA, constructing a ‘null distribution’ of feature scores, and repeat this procedure. We identify ‘significant’ PCs as those who have a strong enrichment of low p-value features. This implements a statistical test based on a random null model, but is time-consuming for large datasets.
Determine stastistical significance of PCA scores using
JackStraw(), compute JackStraw scores significance using
ScoreJAckStraw. JackStrawPlot() function provides a
visualization tool for comparing the distribution of p-values for each
PC with a uniform distribution (dashed line). ‘Significant’ PCs will
show a strong enrichment of features with low p-values (solid curve
above the dashed line).
This takes up to 5 min
JS_hu <- JackStraw(hu_integ, dims = 50, num.replicate = 100)
JS_hu <- ScoreJackStraw(JS_hu, dims = 1:50)
JackStrawPlot(JS_hu, dims = c(1, 50))
## Warning: Removed 3149 rows containing missing values (`geom_point()`).
JS_ch <- JackStraw(ch_integ, dims = 50, num.replicate = 100)
JS_ch <- ScoreJackStraw(JS_ch, dims = 1:50)
JackStrawPlot(JS_ch, dims = c(1, 50))
## Warning: Removed 3063 rows containing missing values (`geom_point()`).
Although both first and last PCs have a significant p-value (< 0.05), the PC1 shows a stronger enrichment of features with low p-values. The cutoff should be somewhere between PC1 and PC50 (wink wink ;))
Last approach is an alternative heuristic method generating an
Elbow plot, which ranks the principle components based on
the percentage of variance explained by each one.
ElbowPlot(hu_integ, ndims = 50)
ElbowPlot(ch_integ, ndims = 50)
Which PC should be the threshold to define the cutoff?
Hint1: Some are difficult to distinguish from background noise for a
dataset of this size without prior knowledge. Recommend to set the
cutoff on the higher side. Hint2: JackStrawPlot() may not
return a clear PC cutoff, please focus on a sharp drop-off in
significance. Hint3: Please take a look in how many PCs the majority of
true signal is captured in ElbowPlot()
After deciding the parameter, rerun the RunPCA() with
defined npcs.
hu_integ <- RunPCA(hu_integ, npcs = [number of pca of your choice], verbose = FALSE)
ch_integ <- RunPCA(ch_integ, npcs = [number of pca of your choice], verbose = FALSE)
we first construct a KNN graph based on the euclidean distance in PCA
space, and refine the edge weights between any two cells based on the
shared overlap in their local neighborhoods. This step is performed
using the FindNeighbors() function, and takes as input the
previously defined dimensionality of the dataset (first x PCs how you
decided).
The FindClusters() function groups cells together, with
the goal of optimizing the standard modularity function, and contains a
resolution parameter that sets the ‘granularity’ of the downstream
clustering, with increased values leading to a greater number of
clusters. The clusters can be found using the Idents()
function.
hu_integ <- FindNeighbors(hu_integ, dims = 1:[number of pca of your choice])
hu_integ <- FindClusters(hu_integ, resolution = 0.6)
ch_integ <- FindNeighbors(ch_integ, dims = 1:[number of pca of your choice])
ch_integ <- FindClusters(ch_integ, resolution = 0.6)
## Computing nearest neighbor graph
## Computing SNN
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 2000
## Number of edges: 59562
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9179
## Number of communities: 17
## Elapsed time: 0 seconds
## Computing nearest neighbor graph
## Computing SNN
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 2499
## Number of edges: 75701
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9114
## Number of communities: 14
## Elapsed time: 0 seconds
Look at cluster IDs of the first 5 cells and check how many clusters (=levels) are found.
head(Idents(hu_integ), 5)
## EC00044 EC00060 EC00296 EC00606 EC00704
## 11 11 11 11 11
## Levels: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
head(Idents(ch_integ), 5)
## EC00002 EC00070 EC00096 EC00136 EC00148
## 11 11 11 11 11
## Levels: 0 1 2 3 4 5 6 7 8 9 10 11 12 13
Seurat offers two major non-linear dimensional reduction techniques,
such as UMAP and tSNE, to visualize and
explore these datasets. The goal of these algorithms is to learn the
underlying manifold of the data in order to place similar cells together
in low-dimensional space. Cells within the graph-based clusters
determined above should co-localize on these dimension reduction plots.
As input to the UMAP and tSNE, we’ll use the
same PCs as input to the clustering analysis.
hu_integ <- RunUMAP(hu_integ, reduction = "pca", dims = 1:[number of pca of your choice])
ch_integ <- RunUMAP(ch_integ, reduction = "pca", dims = 1:[number of pca of your choice])
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 16:15:18 UMAP embedding parameters a = 0.9922 b = 1.112
## 16:15:18 Read 2000 rows and found 20 numeric columns
## 16:15:18 Using Annoy for neighbor search, n_neighbors = 30
## 16:15:18 Building Annoy index with metric = cosine, n_trees = 50
## 0% 10 20 30 40 50 60 70 80 90 100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 16:15:18 Writing NN index file to temp file /var/folders/zv/_22zlcmj58n4_8mg523zflfc0000gn/T//RtmpnOAIRL/file4b583cd65efa
## 16:15:18 Searching Annoy index using 1 thread, search_k = 3000
## 16:15:19 Annoy recall = 100%
## 16:15:19 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 16:15:19 Initializing from normalized Laplacian + noise (using irlba)
## 16:15:19 Commencing optimization for 500 epochs, with 77296 positive edges
## 16:15:22 Optimization finished
## 16:15:22 UMAP embedding parameters a = 0.9922 b = 1.112
## 16:15:22 Read 2499 rows and found 20 numeric columns
## 16:15:22 Using Annoy for neighbor search, n_neighbors = 30
## 16:15:22 Building Annoy index with metric = cosine, n_trees = 50
## 0% 10 20 30 40 50 60 70 80 90 100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 16:15:22 Writing NN index file to temp file /var/folders/zv/_22zlcmj58n4_8mg523zflfc0000gn/T//RtmpnOAIRL/file4b581df58e4a
## 16:15:22 Searching Annoy index using 1 thread, search_k = 3000
## 16:15:23 Annoy recall = 100%
## 16:15:23 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 16:15:24 Initializing from normalized Laplacian + noise (using irlba)
## 16:15:24 Commencing optimization for 500 epochs, with 96180 positive edges
## 16:15:27 Optimization finished
DimPlot(hu_integ, reduction = "umap", group.by = "line")
DimPlot(hu_integ, reduction = "umap", split.by = "line")
DimPlot(hu_integ, reduction = "umap")
DimPlot(ch_integ, reduction = "umap", group.by = "line")
DimPlot(ch_integ, reduction = "umap", split.by = "line")
DimPlot(ch_integ, reduction = "umap")
hu_integ <- RunTSNE(hu_integ, reduction = "pca", dims = 1:[number of pca of your choice])
ch_integ <- RunTSNE(ch_integ, reduction = "pca", dims = 1:[number of pca of your choice])
DimPlot(hu_integ, reduction = "tsne", group.by = "line")
DimPlot(hu_integ, reduction = "tsne", split.by = "line")
DimPlot(hu_integ, reduction = "tsne")
DimPlot(ch_integ, reduction = "tsne", group.by = "line")
DimPlot(ch_integ, reduction = "tsne", split.by = "line")
DimPlot(ch_integ, reduction = "tsne")
## Excercises
Try to cluster the cells with a different number of PCs (10, 15, or even 50!). Do you observe a dramatic difference?
What differences do you observe between UMAP and tSNE plot?
Markers define clusters via differential expression. By default, it
identifies positive and negative markers of a single cluster (specified
in ident.1), compared to all other cells. FindAllMarkers()
automates this process for all clusters, but you can also test groups of
clusters vs. each other, or against all cells. The
FindAllMarkers() function has three important arguments
which provide thresholds for determining whether a gene is a marker:
min.pct: This determines the fraction of cells in
either group the feature has to be expressed in to be included in the
analysis. It’s meant to remove lowly expressed genes. The default is 0.1
and this means at least 10 % of cells in cluster x or all other clusters
must express the gene. The lower the value, the longer computing time.
If this is set to high value, many false positive could be included due
to the fact that not all genes are detected in all cells (even if it is
expressed)
min.diff.pct: Minimum percent difference between the
percent of cells expressing the gene in the cluster and the percent of
cells expressing gene in all other clusters combined. This will
downsample each identity class to have no more cells than whatever this
is set to. The default is -Inf and this means no difference threshold is
set. The lower the value, the longer computing time.
logfc.threshold: Minimum log2 foldchange for average
expression of gene in cluster relative to the average expression in all
other clusters combined. The default is 0.25 and this means the average
log2 foldchange should be at least 0.25. The lower the value, the longer
computing time. If this is set to high value, many weak signals could be
missed.
For performing differential expression after integration, we switch back to the original
DefaultAssay(hu_integ) <- "RNA"
hu_integ
## An object of class Seurat
## 35694 features across 2000 samples within 2 assays
## Active assay: RNA (33694 features, 0 variable features)
## 1 other assay present: integrated
## 3 dimensional reductions calculated: pca, umap, tsne
DefaultAssay(ch_integ) <- "RNA"
ch_integ
## An object of class Seurat
## 34856 features across 2499 samples within 2 assays
## Active assay: RNA (32856 features, 0 variable features)
## 1 other assay present: integrated
## 3 dimensional reductions calculated: pca, umap, tsne
Now, find markers for every cluster compared to all remaining cells, report only the positive ones.
Markers_hu <- FindAllMarkers(hu_integ, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
## Calculating cluster 0
## Calculating cluster 1
## Calculating cluster 2
## Calculating cluster 3
## Calculating cluster 4
## Calculating cluster 5
## Calculating cluster 6
## Calculating cluster 7
## Calculating cluster 8
## Calculating cluster 9
## Calculating cluster 10
## Calculating cluster 11
## Calculating cluster 12
## Calculating cluster 13
## Calculating cluster 14
## Calculating cluster 15
## Calculating cluster 16
Markers_hu
Markers_ch <- FindAllMarkers(ch_integ, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
## Calculating cluster 0
## Calculating cluster 1
## Calculating cluster 2
## Calculating cluster 3
## Calculating cluster 4
## Calculating cluster 5
## Calculating cluster 6
## Calculating cluster 7
## Calculating cluster 8
## Calculating cluster 9
## Calculating cluster 10
## Calculating cluster 11
## Calculating cluster 12
## Calculating cluster 13
Markers_ch
Check if you have the average log2 foldchange lower than 0.25 and how many markers are detected in each cluster.
sum(Markers_hu$avg_log2FC < 0.25)
## [1] 0
Markers_hu %>% count(cluster)
sum(Markers_ch$avg_log2FC < 0.25)
## [1] 0
Markers_ch %>% count(cluster)
Let’s take a look into the marker expression. For that, we’ll collect top 2 markers with lowest p-value per cluster
top2_hu <- Markers_hu %>%
group_by(cluster) %>%
top_n(n = 2, wt = avg_log2FC)
top2_hu
top2_ch <- Markers_ch %>%
group_by(cluster) %>%
top_n(n = 2, wt = avg_log2FC)
top2_ch
Seurat package provides several tools for visualizing
marker expression. VlnPlot() (shows expression probability
distributions across clusters), and FeaturePlot()
(visualizes feature expression on a tSNE or PCA plot) are our most
commonly used visualizations. Also, RidgePlot(),
CellScatter(), and DotPlot() are
available.
Here, we look at the expression level of the top 2 markers with
lowest p-value in cluster 0. In case of using
FeaturePlot(), the default is set to umap reduction.
VlnPlot(hu_integ, features = top2_hu %>% filter(cluster %in% "0") %>% pull(-1))
VlnPlot(ch_integ, features = top2_ch %>% filter(cluster %in% "0") %>% pull(-1))
RidgePlot(hu_integ, features = top2_hu %>% filter(cluster %in% "0") %>% pull(-1))
## Picking joint bandwidth of 0.0144
## Picking joint bandwidth of 0.022
RidgePlot(ch_integ, features = top2_ch %>% filter(cluster %in% "0") %>% pull(-1))
## Picking joint bandwidth of 0.0478
## Picking joint bandwidth of 0.224
FeaturePlot(hu_integ, features = top2_hu %>% filter(cluster %in% "0") %>% pull(-1))
DimPlot(hu_integ, reduction = "umap")
FeaturePlot(ch_integ, features = top2_ch %>% filter(cluster %in% "0") %>% pull(-1))
DimPlot(ch_integ, reduction = "umap")
FeaturePlot(hu_integ, features = top2_hu %>% filter(cluster %in% "0") %>% pull(-1), reduction = "tsne")
DimPlot(hu_integ, reduction = "tsne")
FeaturePlot(ch_integ, features = top2_ch %>% filter(cluster %in% "0") %>% pull(-1), reduction = "tsne")
DimPlot(ch_integ, reduction = "tsne")
Using
Dotplot(), we can get a more comprehensive overview
of marker expressions by providing the top 2 markers in all
clusters.
DotPlot(hu_integ, features = unique(top2_hu %>% pull(-1)), cols = c("blue", "red"), dot.scale = 8) +
RotatedAxis()
DotPlot(ch_integ, features = unique(top2_ch %>% pull(-1)), cols = c("blue", "red"), dot.scale = 8) +
RotatedAxis()
Play with the three major arguments of FindAllMarkers()
function! For each task, check the computing time with
Sys.time() and visualize the marker expression with a
different number of top markers (2, 5, or even 10!)
Try to include more genes per cluster by adjusting the minimum value of the average log2 foldchange. Does it take longer than the code in the tutorial? How many markers are detected in each cluster? Do clusters contain more specific markers?
Try to include less genes per cluster by excluding lowly expressed genes with less than 50 % of cells in cluster x or all other clusters must express the gene. Does it take longer than the code in the tutorial? How many markers are detected in each cluster? Do clusters contain more specific markers?
Try to include more cluster-specifically expressed markers by adjusting the minimum difference between two groups. Does it take longer than the code in the tutorial? How many markers are detected in each cluster? Do clusters contain more specific markers?
After identifying cell type markers in each cluster, it’s also important to assign cell type identity to clusters. Getting canonical markers to known cell types still challenging. Depending on source of cells, e.g. brain or heart, or species, e.g. human or drosophila, there are various databases and studies for markers. In this tutorial, we’ll use a table containing cell types and corresponding markers based on a previous scRNA-seq study.
Import the table containig marker information
CTmarkers <- read.table("files/markers.txt", sep = "\t", header = T)
dim(CTmarkers)
## [1] 9871 2
head(CTmarkers)
We’ll collect top 10 markers with lowest p-value and this will be compared with the imported table
top10_hu <- Markers_hu %>%
group_by(cluster) %>%
top_n(n = 10, wt = avg_log2FC)
top10_hu
top10_ch <- Markers_ch %>%
group_by(cluster) %>%
top_n(n = 10, wt = avg_log2FC)
top10_ch
We’ll now create a custom function to search the most matching cell type to our markers. For each cluster, the function will check how often the top 10 markers match to certain cell types based on the imported table. The mostly matched cell type for each cluster will be returned.
CTretrieve <- function(x){
cluster <- c()
for(i in 1:length(levels(x$cluster))){
top10 <- x %>% filter(cluster %in% levels(x$cluster)[i]) %>% pull(-1)
cluster <- c(cluster, CTmarkers %>%
filter(gene %in% top10) %>%
count(cluster) %>%
arrange(desc(n)) %>%
slice(1) %>%
pull(1))
}
return(cluster)
}
CTretrieve(top10_hu)
## [1] "cortical neurons 1" "ventral progenitors and neurons 1"
## [3] "G2M/S dorsal progenitors 1" "neuroectodermal-like cells"
## [5] "NSC/radial glia" "NSC"
## [7] "G2M/S dorsal progenitors 2" "stem cells 1"
## [9] "ventral progenitors and neurons 3" "stem cells 1"
## [11] "G2M/S ventral progenitors and IPs" "stem cells 1"
## [13] "gliogenic/outer radial glia" "choroid plexus"
## [15] "midbrain/hindbrain" "mesenchymal-like cells"
## [17] "mesenchymal-like cells"
CTretrieve(top10_ch)
## [1] "cortical neurons 1" "G2M/S dorsal progenitors 2"
## [3] "G2M/S dorsal progenitors 2" "cortical neurons 1"
## [5] "IPs and early cortical neurons" "stem cells 1"
## [7] "ventral progenitors and neurons 3" "mesenchymal-like cells"
## [9] "cortical neurons 2" "mesenchymal-like cells"
## [11] "mesenchymal-like cells" "stem cells 1"
## [13] "neuroectodermal-like cells" "gliogenic/outer radial glia"
With this, we can rename our clusters. So far, they were 0, 1, 2, …
new.cluster.ids_hu <- CTretrieve(top10_hu)
names(new.cluster.ids_hu) <- levels(hu_integ)
hu_integ <- RenameIdents(hu_integ, new.cluster.ids_hu)
DimPlot(hu_integ, reduction = "umap")
DimPlot(hu_integ, reduction = "tsne")
new.cluster.ids_ch <- CTretrieve(top10_ch)
names(new.cluster.ids_ch) <- levels(ch_integ)
ch_integ <- RenameIdents(ch_integ, new.cluster.ids_ch)
DimPlot(ch_integ, reduction = "umap")
DimPlot(ch_integ, reduction = "tsne")
Improve this approach. You can either rewrite the function or come up
with a completely different way to achieve the goal. The goal is to
assign the top 10 markers in top10_hu and
top10_ch to cell types of CTmarkers and select
the top cell type that are mostly assigned by the 10 markers.
CTretrieve <- function(x){
cluster <- c()
for(i in 1:length(levels(x$cluster))){
top10 <- x %>% filter(cluster %in% levels(x$cluster)[i]) %>% pull(-1)
cluster <- c(cluster, CTmarkers %>%
filter(gene %in% top10) %>%
count(cluster) %>%
arrange(desc(n)) %>%
slice(1) %>%
pull(1))
}
return(cluster)
}
Enough or should we add DE analysis?